Without a 'doubt'? 
Unsupervised discovery of downward-entailing operators 
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Abstract 

An important part of textual inference is mak- 
ing deductions involving monotonicity, that 
is, determining whether a given assertion en- 
tails restrictions or relaxations of that asser- 
tion. For instance, the statement 'We know the 
epidemic spread quickly^ does not entail 'We 
know the epidemic spread quickly via fleas', 
but 'We doubt the epidemic spread quickly' 
entails 'We doubt the epidemic spread quickly 
via fleas'. Here, we present the first algorithm 
for the challenging lexical-semantics prob- 
lem of learning linguistic constructions that, 
like 'doubt', are downward entailing (DE). 
Our algorithm is unsupervised, resource-lean, 
and effective, accurately recovering many DE 
operators that are missing from the hand- 
constructed lists that textual-inference sys- 
tems currently use. 

Publication venue: NAACL HLT 2009 



1 Introduction 

Making inferences based on natural-language state- 
ments is a crucial part of true natural-language un- 
derstanding, and thus has many important appli- 
cations. As the field of NLP has matured, there 
has been a resurgence of interest in creating sys- 
tems capable of making such inferences, as evi- 
denced by the activity suiTounding the ongoing se- 
quence of "Re cognizing Textual Entailment" (RTE) 
competitions (Pagan et al , 20061: Bar-Haim et al 



20061 : iGiampiccolo et all. l2007h and the AQUAINT 



The following two examples help illustrate the 
particular type of inference that is the focus of this 
paper. 

1. 'We know the epidemic spread quickly' 

1. 'We doubt the epidemic spread quickly' 

A relaxation of 'spread quickly' is 'spread'; a re- 
striction of it is 'spread quickly via fleas'. From 
statement 1, we can infer the relaxed version, 'We 
know the epidemic spread', whereas the restricted 
version, 'We know the epidemic spread quickly via 
fleas' , does not follow. But the reverse holds for 
statement 2: it entails the restricted version 'We 
doubt the epidemic spread quickly via fleas' , but not 
the relaxed version. The reason is that 'doubt' is a 
downward-entailing operator\^m other words, it al- 
lows one to, in a sense, " reason from sets to subsets" 
(lvanderWoudenlll997[ pg. 90). 

Downward-entailing operators are not restricted 
to assertions about belief or to verbs. For example, 
the preposition 'without' is also downward entail- 
ing: from 'The applicants came without payment or 
waivers' we can infer that all the applicants came 
without payment. (Contrast this with 'with', which, 
like 'know', is upward entailing.) In fact, there are 
many downward-entailing operators, encompassing 
many syntactic types; these include explicit nega- 
tions like 'no' and 'never', but also many other 
terms, such as 'refiise (to)', 'preventing' , 'nothing', 
'rarely' , and 'too [adjective] to' . 



know ledge-based evaluation project (ICrouch et al 

iooih . 



'Synonyms for "downward entailing" include downward- 
monotonic and monotone decreasing. Related concepts include 
anti-additivity, veridicality, and one-way implicatives. 



As the prevalence of thes e operators indicates and 



as 



van 



der WoudenI ( 1997 . pg. 92) states, down 



ward entailment "plays an extremely important role 
in n atural language" (van Benthein, 1986' Hoek- 



sema. Il986l: ISanchez Valenciai Il99ll : lDowtv[[l994 : 
MacCartney and Manningl. 120071) . Yet to date, only 
a few systems attempt to handle the phenomenon 
in a general way, i. e., to conside r more than sim- 
ple direct negati on (Nairn et al. , 20061: MacCart- 
ney and Mannin g, 20081 : IChristodoulopoulosl boOSi : 
Bar-Haim et"aD. l2008h . These systems rely on lists 
of items annotated with respect to their behavior in 
"polar" (positive or negative) environments. The 
lists contain a relatively small number of downward- 
entailing operators, at least in part because they were 
constructed mainly by manual inspection of verb 
lists (although a few non-verbs are sometimes also 
included). We therefore propose to automatically 
learn downward-entailing operator^ — henceforth 
DE operators for short — from data; deriving more 
comprehensive lists of DE operators in this man- 
ner promises to substantially enhance the ability 
of textual-inference systems to handle monotonicity- 
related phenomena. 

Summary of our approach There are a num- 
ber of significant challenges to applying a learning- 
based approach. First, to our knowledge there do 
not exist DE-operator-annotated corpora, and more- 
over, relevant types of semantic information are "not 
available i n or deducible fro m any public lexical 
database" (INaim et all l2006h . Also, it seems there 
is no simple test one can appl y to all possible candi- 
dates; IvaiLderWoudea (tl997|, pg. 110) remarks, "As 
a rule of thumb, assume that everything that feels 
negative, and everything that [satisfies a condition 
described below], is monotone decreasing. This rule 
of thumb will be shown to be wrong as it stands; but 
it sort of works, like any rule of thumb." 



We include superlatives Ctallesf), comparatives ('taller'), 
and conditionals Cif ) in this category because they have non- 
default (i.e., non-upward entailing) properties — for instance, 
'he is the tallest father does not entail 'he is the tallest man' . 
Thus, they also require special treatment when considering en- 
tailment relations. In fact, there have been some attempts 
to u nify these various types of non-upward entailing opera- 
tors dvon Fintell[l999l) . We use the term downward entailing 
(narrowly-defined) (DE(ND)) when we wish to specifically ex- 
clude superlatives, comparatives, and conditionals. 



Our first insight into how to overcome these chal- 
lenges is to leverage a find ing from the linguistics lit- 
erature, Ladusaw's hypothesis, which can be 
treated as a cue regarding the distribution of DE op- 
erators: it asserts that a certain class of lexical con- 
structions known as negative polarity items (NPIs) 
can only appear in the scope of DE operators. Note 
that this hypothesis suggests that one can develop 
an unsupervised algorithm based simply on check- 
ing for co-occurrence with known NPIs. 

But there are significant problems with apply- 
ing this idea in practice, including: (a) there is no 
agreed-upon list of NPIs; (b) terms can be ambigu- 
ous with respect to NPI-hood; and (c) many non-DE 
operators tend to co-occur with NPIs as well. To 
cope with these issues, we develop a novel unsuper- 
vised distillation algorithm that helps filter out the 
noise introduced by these problems. This algorithm 
is very effective: it is accurate and derives many DE 
operators that do not appear on pre-existing lists. 

Contributions Our project draws a connection be- 
tween the creation of textual entailment systems and 
linguistic inquiry regarding DE operators and NPIs, 
and thus relates to both language-engineering and 
linguistic concerns. 

To our knowledge, this work represents the first 
attempt to aid in the process of discovering DE oper- 
ators, a task whose importance we have highlighted 
above. At the very least, our method can be used 
to provide high-quality raw materials to help human 
annotators create more extensive DE operator lists. 
In fact, while previous manual-classification efforts 
have mainly focused on verbs, we retrieve DE oper- 
ators across multiple parts of speech. Also, although 
we discover many items (including verbs) that are 
not on pre-existing manually-constructed lists, the 
items we find occur frequently — they are not some- 
how peculiar or rare. 

Our algorithm is surprisingly accurate given that it 
is quite resource- and knowledge-lean. Specifically, 
it relies only on Ladusaw's hypothesis as initial in- 
spiration, a relatively short and arguably noisy list 
of NPIs, and a large unannotated corpus. It does 
not use other linguistic information — for exam- 
ple, we do not use parse information, even though 
c-command relations have been asserted to play a 
key role in the licensing of NPIs (|van der WoudenL 



19971). 



2 Method 

We mentioned in the introduction some significant 
challenges to developing a machine-learning ap- 
proach to discovering DE operators. The key insight 
we apply to surmount these challenges is that in the 
linguistics literature, it has been hypothesized that 
there is a strong connection between DE operators 
and negative polarity items (NPIs), which are terms 
that tend to occur in "negative environments". An 
example NPI is 'anymore': one can say 'We don't 
have those anymore' but not '*We have those any- 
more'. 

Specifically, we pr opose to take ad vantage of the 



nyp. 
lien I 



seminal hypothesi s of lLadusawl (119801 influenced by 



Fauconnie il(ll975D, inter alia): 



(Ladusaw) NPIs only appear within the 
scope of downward-entailing operators. 

This hypothesis has been actively dis cussed, up 



dated , and contest e d by multiple parties ( Linebarger 



19871 : Ivon FintelL 119991 : iGiannakidouL 12001 inter 
alia). It is not our intent to comment (directly) on its 
overall validity. Rather, we simply view it as a very 
useful starting point for developing computational 
tools to find DE operators — indeed, even detractors 
of the theory have calle d it "impressively algorith- 
mic" (ILinebargeii 1 1 9871 pg. 361). 

First, a word about scope. For Ladusaw 's hypoth- 
esis, scope should arguably be defined in terms of c 



comm and, immediate scope, and so on (Ivon Fintel 



19991 pg. 100). But for simplicity and to make our 
approach as resource-lean as possible, we simply as- 
sume that potential DE operators occur to the left of 
NPIsJl except that we ignore text to the left of any 
preceding commas or semi-colons as a way to en- 
force a degree of locality. For example, in both 'By 
the way, we don 't have plants anymore^-pi because 
they died' and 'we don't have plants anymore^pi' , 
we look for DE operators within the sequence of 
words 'we don't have plants'. We refer to such se- 
quences in which we seek DE operators as NPI con- 
texts. 



^ There are a few exceptio ns, such as with the NPI "for the 
life of me' ' i lHoeksemj.[l993l) . 



Now, Ladusaw's hypothesis suggests that we can 
find DE operators by looking for words that tend to 
occur more often in NPI contexts than they occur 
overall. We formulate this as follows: 



Assumption: For any 

FbyNPl(d) > F{d). 



DE operator d, 



Here, FbyNPiC*^) is the number of occurrences of d 
in NPI contexts divided by the number of words 
in NPI contexts, and F{x) refers to the number of 
occurrences of x relative to the number of words in 
the corpus. 

An additional consideration is that we would like 
to focus on the discovery of novel or non-obvious 
DE operators. Therefore, for a given candidate DE 
operator c, we compute -FbyNPi(c): the value of 
FbyNPi(c) that results ;/ we discard all NPI con- 
texts containing a DE operator on a list of 10 well- 
known instances, namely, 'not', 'n't', 'no', 'none', 
'neither' , 'nor' , 'few' , 'each' , 'every' , and 'without' . 
(This list is based on the list of DE operators used by 
the RTF system presented in MacCartnev and Man- 
ning ( 20081) . ) This yields the following scoring func- 
tion: 



5(c) 



byNPI 



F(c) 



(1) 



Distillation There are certain terms that are not 
DE operators, but nonetheless co-occur with NPIs as 
a side-effect of co-occurring with true DE operators 
themselves. For instance, the proper noun 'Milken' 
(referring to Michael Milken, the so-called "junk- 
bond king") occurs relatively frequently with the DE 
operator 'denies', and 'vigorously' occurs frequently 
with DE operators like 'deny' and 'oppose'. We re- 
fer to terms like 'milken' and 'vigorously' as "pig- 
gybackers", and address the piggybackers problem 
by leveraging the following intuition: in general, we 
do not expect to have two DE operators in the same 
NPI contextlf] One way to implement this would be 
to re-score the candidates in a winner-takes-all fash- 
ion: for each NPI context, reward only the candidate 



^Even if d occurs multiple times in a single NPI context we 
only count it once; this way we "dampen the signal" of func- 
tion words that can potentially occur multiple times in a single 
sentence. 

'One reason is that if two DE operators are composed, they 
ordinarily create a positive context, whic h would not license 
NPIs (although this is not always the case i lDowtvlfTggi) ). 



with the highest score S. However, such a method 
is too aggressive because it would force us to pick 
a single candidate even when there are several with 
relatively close scores — and we know our score S is 
imperfect. Instead, we propose the following "soft" 
mechanism. Each sentence distributes a "budget" of 
total score 1 among the candidates it contains ac- 
cording to the relative scores of those candidates; 
this works out to yield the following new distilled 
scoring function 



NPIcontcxts p 



S{c) 



N{c) 



(2) 



where n{p) = J2c£p 'S'(c) is an NPI-context normal- 
izing factor and N{c) is the number of NPI con- 
texts containing the candidate c. This way, plausi- 
ble candidates that have high S scores relative to the 
other candidates in the sentence receive enhanced Sd 
scores. To put it another way: apparently plausible 
candidates that often appear in sentences with mul- 
tiple good candidates (i.e., piggybackers) receive a 
low distilled score, despite a high initial score. 

Our general claim is that the higher the distilled 
score of a candidate, the better its chances of being 
a DE operator. 



Choice of NPIs Our proposed method requires ac- 
cess to a set of NPIs. However, there does not ap- 
pear to be univers al agreement on such a set. Lichte 
and Soehn (2007) mention some doubts regarding 
approximately 200 (!) of th e items on a r oughl y 350- 
item list of German NPIs (IKurschneri 119831). For 
English, the "moderately complete^ Lawler ( 2005 ) 
list contains two to three dozen items; however, 
there is also a list of English NPIs that is s everal 
times longer ( von Bergen and von Bergenl 1993 . 
written in German), and iHoeksema (Il997h asserts 



that English should have hundreds of NPIs, similarly 
to French and Dutch. 

We choose to focus on the items on these lists 
that seem most likely to be effective cues for our 
task. Specifically, we select a subset of the Lawler 
NPIs, focusing mostly on those that do not have 
a relatively frequent non-NPI sense. An example 
discard is 'much', whose NPI-hood depends on 



what it modifies and perhaps on w hether there 
are d egree adverbs pre-modifying it ( Hoekseniial 
1997h . There are some ambiguous NPIs that we 
do retain due to their frequency. For example, 
'any occurs both in a non-NPI "free choice" 
variant, as in 'any idiot can do that', and in an 
NPI version. Although it is ambiguous with re- 
spect to NPI-hood, 'any' is also a very valuable 
cue due to its frequency^ Here is our NPI list: 



any 


in weeks/ages/years 


budge 


yet 


at all 


drink a drop 


red cent 


ever 


give a damn 


last/be/take long 


but what 


bother to 


do a thing 


arrive/leave until 


give a shit 


lift a finger 


bat an eye 


would care/mind 


eat a bite 


to speak of 



3 Experiments 

Our main set of evaluations focuses on the precision 
of our method at discovering new DE operators. We 
then briefly discuss evaluation of other dimensions. 

3.1 Setup 

We applied our method to the entirety of the BLLIP 
(Brown Laboratory for Linguistic Information Pro- 
cessing) 1987-89 WSJ Corpus Release 1, available 
from the LDC (LDC2000T43). The 1,796,379 sen- 
tences in the corpus comprise 53,064 NPI contexts; 
after discarding the ones containing the 10 well- 
known DE operators, 30,889 NPI contexts were left. 
To avoid sparse data problems, we did not consider 
candidates with very low frequency in the corpus 
(< 150 occuiTcnces) or in the NPI contexts (< 10 oc- 
currences). 

Methodology for eliciting judgments The obvi- 
ous way to evaluate the precision of our algorithm is 
to have human annotators judge each output item as 
to whether it is a DE operator or not. However, there 
are some methodological issues that arise. 

First, if the judges know that every term they are 
rating comes from our system and that we are hoping 
that the algorithm extracts DE operators, they may 
be biased towards calling every item "DE" regard- 
less of whether it actually is. We deal with this prob- 
lem by introducing distractors — items that are not 
produced by our algorithm, but are similar enough 
to not be easily identifiable as "fakes". Specifically, 



|www- personal ■ umich . edu/~j lawler / aue/ 
npi . html 



'it is by far the most frequent NPI, appearing in 36,554 of 
the sentences in the BLLIP corpus (see Section|3}. 



for each possible part of speech of each of our sys- 
tem's outputs c that exists in WordNet, we choose a 
distractor that is either in a "sibling" synset (a hy- 
ponym of c's hypernym) or an antonym. Thus, the 
distractors are highly related to the candidates. Note 
that they may in fact also be DE operators. 

The judges were made aware of the presence of 
a substantial number of distractors (about 70 for the 
set of top 150 outputs). This design choice did seem 
to help ensure that the judges carefully evaluated 
each item. 

The second issue is that, as mentioned in the in- 
troduction, there does not seem to be a uniform test 
that judges can apply to all items to ascertain their 
DE-ness; but we do not want the judges to impro- 
vise excessively, since that can introduce undesir- 
able randomness into their decisions. We therefore 
encouraged the judges to try to construct sentences 
wherein the arguments for candidate DE operators 
were drawn from a set of phrases and restricted 
replacements we specified (example: 'singing' vs 
'singing loudly'). However, improvisation was still 
required in a number of cases; for example, the can- 
didate 'acf , as either a noun or a verb, cannot take 
'singing' as an argument. 

The labels that the judges could apply were 
"DE(ND)" (downward entailing (narrowly- 
defined)), "superlative", "comparative", "condi- 
tional", "hard to tell", and "not-DE" (= none of the 
above). We chose this fine-grained sub-division 
because the second through fourth categories are 
all known to co-occur with NPIs. There is some 
debate in the linguistics literature as to whether 
they can be considered to be d ownward en t ailing 



narrowly construed, or not (Ivon Fintell. 1 199 



1 



inter alia), but nonetheless, such operators call for 
special reasoning quite distinct from that required 
when dealing with upward entailing operators — 
hence, we consider it a success when our algorithm 
identifies them. 

Since monotonicity phenomena can be rather sub- 
tle, the judges engaged in a collaborative process. 
Judge A (the second author) annotated all items, but 
worked in batches of around 10 items. At the end of 
each batch. Judge B (the first author) reviewed Judge 
As decisions, and the two consulted to resolve dis- 
agreements as far as possible. 

One final remark regarding the annotation: some 



decisions still seem uncertain, since various factors 
such as context, Gricean maxims, what should be 
presupposecjf] and so on come into play. However, 
we take comfort in a comment by Eugene Charniak 
(personal communication) to the effect that if a word 
causes a native speaker to pause, that word is inter- 
esting enough to be included. And indeed, it seems 
reasonable that if a native speaker thinks there might 
be a sense in which a word can be considered down- 
ward entailing, then our system should flag it as a 
word that an RTE system should at least perhaps 
pass to a different subsystem for further analysis. 

3.2 Precision Results 

We now examine the 150 items that were most 
highly ranked by our system, which were sub- 
sequently annotated as just described. (For 
full system output that includes the unannotated 
items, see |http: //www, cs . Cornell . edu/ 
'~cristian'. We would welcome external anno- 
tation help.) As shown in Figure [T^, which depicts 
precision at k for various values of k, our system 
performs very well. In fact, 100% of the first 60 out- 
puts are DE, broadly construed. It is also interesting 
to note the increasing presence of instances that the 
judges found hard to categorize as we move further 
down the ranking. 

Of our 73 distractors, 46% were judged to be 
members of one of our goal categories. The fact that 
this percentage is substantially lower than our algo- 
rithm's precision at both 73 and 150 (the largest k we 
considered) confirms that our judges were not mak- 
ing random decisions. (We expect the percentage 
of DE operators among the distractors to be much 
higher than because they were chosen to be simi- 
lar- to our system's outputs, and so can be expected 
to also be DE operators some fraction of the time.) 

Table[T]shows the lemmas of just the DE(ND) op- 
erators that our algorithm placed in its top 150 out- 
puts]^ Most of these lemmas are new dis coveri es, in 
the sense of not appearing in Ladusaw's (Il980l) (im- 
plicit) enumeration of DE operators. Moreover, the 



For example, 'X doubts the epidemic spread quickly' might 
be said to entail 'X would doubt the epidemic spreads via fleas, 
presupposing that X thinks about the flea issue' . 

'By listing lemmas, we omit variants of the same word, such 
as 'doubting' and 'doubted' , to enhance readability. We omit 
superlatives, comparatives, and conditionals for brevity. 




10 20 30 40 50 60 70 



90 100 110 120 130 140 150 




10 20 30 40 50 60 70 



90 100 110 120 130 140 150 



(a) 



(b) 



Figure 1: (a) Precision at k for k divisible by 10 up to fc ~ 150. The bar divisions are, from the x-axis up, 
DE(ND) (blue, the largest); Superlatives/Conditionals/Comparatives (green, 2nd largest); and Hard (red, sometimes 
non-existent). For example, all of the first 10 outputs were judged to be either downward entailing (narrowly-defined) 
(8 of 10, or 80%) or in one of the related categories (20%). (b) Precision at fc when the distillation step is omitted. 



not-DE 


Hard 


almost 


firmly 


one-day 


approve 


ambitious 


fined 


signal 


cautioned 


considers 


liable 


remove 


dismissed 


detect 


notify 


vowed 


fend 



Table 3: Examples of words judged to be either not in 
one of our monotonicity categories of interest (not-DE) 
or hard to evaluate (Hard). 



lists of DE(ND) operators that are used by textual- 
entailment systems are significantly smaller than 
that depicted in Table [Tl for example. MacCartnev 
and Manning ( 2008b use only about a dozen (per- 
sonal communication). 

Table [3] shows examples of the words in our sys- 
tem's top 150 outputs that are either clear mistakes 
or hard to evaluate. Some of these are due to id- 
iosyncrasies of newswire text. For instance, we of- 
ten see phrases like 'biggest one-day drop in 
where 'one-day' piggybacks on superlatives, and 
'vowed' piggybacks on the DE operator 'veto', as 
in the phrase 'vowed to veto'. 



Effect of distillation In order to evaluate the im- 
portance of the distillation process, we study how 
the results change when distillation is omitted (thus 
using as score function S from Equation [J rather 
than Sd)- When comparing the results (summarized 
in Figure \Vp) with those of the complete system 
(Figure [T^) we observe that the distillation indeed 
has the desired effect: the number of highly ranked 



words that are annotated as not-DE decreases after 
distillation. This results in an increase of the preci- 
sion at k ranging from 5% to 10% (depending on k), 
as can be observed by comparing the height of the 
composite bars in the two figuresF'l 

Importantly, this improvement does indeed seem 
to stem at least in part from the distillation process 
handling the piggybacking problem. To give just a 
few examples: 'vigorously' is pushed down from 
rank 48 (undistilled scoring) to rank 126 (distilled 
scoring), 'one-day' from 25**^ to 65**^, 'vowed' from 
45*'" to 75^^, and 'Milken' from 121"* to 350*''. 

3.3 Other Results 

It is natural to ask whether the (expected) decrease 
in precision at k is due to the algorithm assigning 
relatively low scores to DE operators, so that they 
do not appear in the top 150, or due to there be- 
ing no more more true DE operators to rank. We 
cannot directly evaluate our method's recall because 
no comprehensive list of DE operators exists. How- 
ever, to get a rough impression, we can check how 
our system ranks the items in the largest list we are 
aware of, namely, the Ladusaw (implicit) list men- 
tioned above. Of the 3 1 DE operator lemmas on this 
list (not including the 10 well-known DE operators), 
only 7 of those frequent enough to be considered by 
our algorithm are not in its top 150 outputs, and only 



'"The words annotated "hard" do not affect this increase in 
precision. 



absence ot 


to defer 


naruiy 


premature to 


to rule out 


to veto 


absent from 


to deny (L) 


to lack 


to prevent 


skeptical • 


wary of 


anxious about • 


to deter 


innocent of • 


to prohibit 


to suspend 


warned that (L) 


to avoid (L) 


to discourage 


to minimize • 


rarely (L) 


to thwart 


whenever 


to bar 


to dismiss 


never (L) 


to refrain from 


unable to 


withstand 


barely 


to doubt (L) 


nobody 


to refuse (L) 


unaware of 




to block 


to eliminate 


nothing 


regardless • 


unclear on 




cannot (L) 


essential for • 


to oppose 


to reject 


unlike 




compensate for • 


to exclude 


to postpone • 


reluctant to (L) 


unlikely (L) 




to decline 


to fail (L) 


to preclude 


to resist 


unwilling to 





Table 1: The 55 lemmas for the 90 downward entaihng (narrowly-defined) operators among our algorithm's top 150 
outputs. (L) marks instances from Ladusaw's list. • marks some of the more interesting cases. We have added 
function words (e.g., 'to', '/or') to indicate parts of speech or subcategorization; our algorithm does not discover 
multi-word phrases. 



Original 




Restriction 


Dan is unlikely to sing. 




Dan is unlikely to sing loudly. 


Olivia compensates for eating by exercising. 




Olivia compensates for eating late by exercising. 


Ursula refused to sing or dance. 




Ursula refused to sing. 


Bob would postpone singing. 




Bob would postpone singing loudly. 


Talent is essential for singing. 




Talent is essential for sin.gin.g a ballad. 


She will finish regardless of threats. 


<=/= 


She will finish regardless of threats to my career. 



Table 2: Example demonstrations that the underlined expressions (selected from Table [U are DE operators: the 
sentences on the left entail those on the right. We also have provided <^ indicators because the reader might find it 
helpful to reason in the opposite direction and see that these expressions are not upward entailing. 



5 are not in the top 300. Remember that we only an- 
notated the top 150 outputs; so, there may be many 
other DE operators between positions 150 and 300. 

Another way of evaluating our method would be 
to assess the effect of our newly discovered DE op- 
erators on downstream RTE system performance. 
There are two factors to take into account. First, the 
DE operators we discovered are quite prevalent in 
naturally occurring texl*^ : the 90 DE(ND) operators 
appealing in our algorithm's top 150 outputs occur 
in 111,456 sentences in the BLLIP corpus (i.e., in 
6% of its sentences). Second, as previously men- 
tioned, systems do already account for monotonic- 
ity to some extent — but they are limited by the fact 
that their DE operator lexicons are restricted mostly 
to well-known instances; to take a concrete example 
with a publicly available RTE system: Nutcracker 
( Bos and Markert , 20061) correctly infers that 'We 
did not know the disease spread' entails ' We did not 
know the disease spread quickly' but it fails to in- 



fer that 'We doubt the disease spread' entails 'We 
doubt the disease spread quickly'. So, systems can 
use monotonicity information but currently do not 
have enough of it; our method can provide them with 
this information, enabling them to handle a greater 
fraction of the large number of naturally occurring 
instances of this phenomenon than ever before. 

4 Related work not already discussed 



However, RTE competitions do not happen to currently 
stress inferences involving monotonicity. The reasons why are 
beyond the scope of this paper. 



Magnini ( 20081) . in describing modular approaches 
to textual entailment, hints that NPIs may be used 
within a negation-detection sub-component. 

There is a substantial body of work in the linguis- 
tics literature r egarding the definition and ii ature of 
polarity items (IPolarity Items Bibliographyh . How- 
ever, very little of this work is computational. There 
has been passing speculation that one might want 
to learn polarity-inverting verbs ( Christodoulopou- 
los. I2008L pg. 47). There have also been a few 
projects on the discovery of NPIs , which is the con - 
verse of the problem we consider. iHoeksemal (Il997h 
discusses some of the difficulties with corpus-based 
determination of NPIs, including "rampant" poly- 



semy and the problem of "how to determine inde- 
pendently which predicates should count as nega- 
tive" — a pr oblem which o ur work addresses. Lichte 
and Soehn (|LichteL I2OO5I : Lichte and SoehnL l2007h 
consider finding German NPIs using a method con- 
ceptually similar in some respects to our own, al- 
though again, their objective is the reverse of ours. 
Their discovery statistic for single-word NPIs is the 
ratio of within-licenser-clause occurrences to total 
occurrences, where, to enhance precision, the list of 
licensers was filtered down to a set of fairly unam- 
biguous, easily-identified items. They do not con- 
sider distillation, which we found to be an impor- 
tant component of our DE-operator-detection algo- 
rithm. Their evaluation scheme, unlike ours, did not 
employ a bias-compensation mechanism. They did 
employ a collocation-detection technique to extend 
their list to multi-word NPIs, but our independent 
experiments with a similar technique (not reported 
here) did not yield good results. 

5 Conclusions and future work 



To our knowledge, this work represents the first at- 
tempt to discover downward entailing operators. We 
introduced a unsupervised algorithm that is moti- 
vated by research in linguistics but employs simple 
distributional statistics in a novel fashion. Our algo- 
rithm is highly accurate and discovers many reason- 
able DE operators that are missing from pre-existing 
manually-built lists. 

Since the algorithm is resource-lean — requiring 
no parser or tagger but only a list of NPIs — it can be 
immediately applied to languages where such lists 
exist, such as German and Romanian ( Trawiriski and 
Soehn, llQOsF On the other hand, although the re- 
sults are already quite good for English, it would 
be interesting to see what improvements could be 
gained by using more sophisticated syntactic infor- 
mation. 

For languages where NPI lists ai^e not extensive, 
one could envision applying an iterative co-learning 
approach: use the newly-derived DE operators to in- 
fer new NPIs, and then discover even more new DE 
operators given the new NPI list. (For English, our 
initial attempts at bootstrapping from our initial NPI 
list on the BLLIP corpus did not lead to substantially 
improved results.) 



In practice, subcategorization is an important fea- 
ture to capture. In Table [H we indicate which sub- 
categorizations are DE. An interesting extension of 
our work would be to try to automatically distin- 
guish particular DE subcategorizations that are lex- 
ically apparent, e.g., 'innocenf (not DE) vs. 'inno- 
cent of (as in 'innocent of burglary' , DE). 

Our project provides a connection (among many) 
between the creation of textual entailment systems 
(the domain of language engineers) and the char- 
acterization of DE operators (the subject of study 
and debate among linguists). The prospect that our 
method might potentially eventually be refined in 
such a way so as to shed at least a little light on lin- 
guistic questions is a very appealing one, although 
we cannot be certain that any progress will be made 
on that front. 
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